5 Training (Attention Is All You Need)

5.1 Training Data and Batching

5.2 Hardware and Schedule

5.3 Optimizer

We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and ε = 10**−9.

We varied the learning rate over the course of training, according to the formula (※式(3))

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.

「最初のwarmup_stepsの間、学習率は線形に増加する」

「それ以降はステップ数の平方根の逆数に比例して減少する」

（逆数に比例＝元の数には反比例　ということ？）

（warmupという概念は先行研究から？）

We used warmup_steps = 4000.

5.4 Regularization

訓練の間、3種の正則化を利用

Residual Dropout（2通り）

We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.

「各サブレイヤーの出力に、サブレイヤーの入力と足され正規化される前に、ドロップアウトを適用」

（Figure 1でいうと、EncoderやDecoderの各層？）

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.

「加えて、encoderとdecoderのスタックの両方で、埋め込みとpostional encodingの和にドロップアウトを適用」

（Figure 1でいうと、EncoderやDecoderに入る前のところではないか）

For the base model, we use a rate of P_drop = 0.1.

Label Smoothing

During training, we employed label smoothing of value εls = 0.1.

ref 36 Rethinking the Inception Architecture for Computer Vision

εlsはLabel Smoothingのε（？　参照論文には出てこなさそう）

This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

（perplexity　混乱？）